ptairMS 0.1.4
library(ptairMS)
library(ptairData)
The ptairMS package provides a workflow to process PTR-TOF-MS raw data in the open Hierarchical Data Format 5 (HDF5; .h5 extension), and generate the peak table as an ExpressionSet object for subsequent data analysis with the many methods and packages available in R.
We present in this vignette a quick handling of this package, and application to two dataset : exhaled air, mycobacteria culture air. For more details on the parameters choice and algorithm, see ptairMS vignette.
ptairData packageTwo real PTR-TOF-MS (Proton Transfert Reaction Time Of Flight) raw data sets are available on the ptairData package, from
exhaled air of two healthy individal, three acquisitions per individual on diffrents day with two exprirations
mycobateria culture air, two replicates of each two diffrent species and one control (culture medium)
For memory reasons, the mass axis is truncate at the range: [20.4,21.6] U [50 , 150] for human data set, and [20.4 ; 21.6] U [56.4 ; 90.6] for mycobacteria.
To process several files, a ptrSet object is first generated with the function createPtrSet from the directory containing the raw files, possibly grouped into subfolders according to classes of samples. The object will contain all information about the study: the sample metadata, one peak list for each samples, and several quality metrics about the files. Second, the peak lists are aligned between samples and an ExpressionSet (Biobase package object) object is returned , containing the peak table, sample metadata, and feature metadata.
We start from a directory :
dirRaw <- system.file("extdata/exhaledAir", package = "ptairData")
exhaledSet <- createPtrSet(dirRaw,
setName="exhaledSet",
mzCalibRef = c(21.022, 60.0525,75.04406),
fracMaxTIC = 0.9,
saveDir = NULL )
## ind1-1.h5 check
## ind1-2.h5 check
## ind1-3.h5 check
## ind2-1.h5 check
## ind2-2.h5 check
## ind2-3.h5 check
This function performs for each files :
calibration with mzCalibRef parameters
expiration detection, that correspond to the part of Total Ion Chromatogram where the intensity is higher than fracMaxTIC * max(TIC)
quantify th eprimary ion on the isotope H3(O18)+
provides a default sampleMetadata based on the tree structure of the directory
A summary plot to check the calibration error and reference calibration peak average shape:
plot(exhaledSet)
Interactive plot to see Total Ion Spectrum and expirations limits:
plotTIC(exhaledSet,showLimits = T,baselineRm = T)
After creating the ptrSet, we porcess the files with detectPeak function.
For each file in the data set, this function :
calibrates each expiration and ambiant air
detects peak in each expiration and ambiant air
aligns features between the background and diffrents expirations
keeps features present in at least fracGroup % of expirations
if normalize=TRUE, normalizes with the primary ion (devided all features by the quantity of H3(O18)+ *488), and converts in ppb if the ptr reaction information are available in the h5 file (voltage, temperature and pressure of the drift)
exhaledSet <- detectPeak(exhaledSet)
## ind1-1.h5 : 125 peaks detected
## ind1-2.h5 : 133 peaks detected
## ind1-3.h5 : 117 peaks detected
## ind2-1.h5 : 133 peaks detected
## ind2-2.h5 : 142 peaks detected
## ind2-3.h5 : 132 peaks detected
This function can take few times if there are many files and a large mz range. You can then parallelize the algorithm by set parallel = TRUE and give the number of cores in nbCores (the number of files that will be processed simultaneous). To see the number of CPU cores available in your computer, use parallel::detectCores().
Once the files have been all processed, we then align samples (i.e. matching of variables between the peak lists of the samples) with alignSamples function. It generate an expressionSet object, that contains the resulting peak table, the sample meta data (from ptrSet) and variable meta data containing the background peak table.
It is possible to apply two filters:
On the background: only variables with intensities greater than bgThreshold * background are kept
On samples reproducibility: only variables which are detected in more than \(fracGroup\) % of samples are kept.
If you do not want to apply those filters, set fracGroup and bgThreshold to 0.
eSet <- alignSamples(exhaledSet, group="subfolder", fracGroup = 1, bgThreshold = 4 )
## 33 peaks aligned
To imputue missing values, ptairMS returns back to the raw data, and compute with the same method as detectPeak the missing values without limit of detection.
eSet <- ptairMS::impute(eSet, exhaledSet)
## ind1-1.h5 done
## ind1-2.h5 done
## ind1-3.h5 done
ptairMS propose a putative annotation based on the database saved in 'inst/extdata/reference_tables/vocDB_tables. The function write in the features metadata all possible formulas and names found for each features
eSet<-annotateVOC(eSet)
You can export the expressionSet with the function writeEset into 3 tabulated files ‘dataMatrix.tsv’, sampleMetadata.tsv’, and ‘variableMetadata.tsv’:
writeEset(eset, dirC = file.path(getwd(), "processed_dataset"))
You can finally use known package as ropls to performs PCA or OPLS analysis.
X<-Biobase::exprs(eSet)
SMD<-Biobase::pData(eSet)
features<-Biobase::fData(eSet)
ropls::view(log2(X),printL = FALSE)
pca<-ropls::opls(t(log2(X)),crossvalI=5,info.txtC="none",fig.pdfC="none")
plot(pca,type="x-score",parAsColFcVn=SMD$subfolder)
dim1<-ropls::getLoadingMN(pca)[,1]
barplot(sort(abs(dim1),decreasing = T))
knitr::kable(head(features[names(sort(abs(dim1),decreasing = T)),c("vocDB_formula_Hplus","vocDB_cas.name")]))
| vocDB_formula_Hplus | vocDB_cas.name | |
|---|---|---|
| 67.0538 | [C5H6+H]+ | 1,3-cyclopentadiene |
| 70.0732 | ||
| 69.0699 | [C5H8+H]+, [H8C5+H]+ | (E)-, 1,3-pentadiene, 3-methyl-1-butyne, cyclopentene, isoprene, pentyne ui |
| 109.0666 | [C7H8O+H]+ | 1-hydroxy-3-methylbenzene, 4-methylphenol , benzyl alcohol, p-cresol |
| 68.0598 | ||
| 138.1339 |
plotFeatures plot for all files the average spectrum baseline corrected of all expirations, and background superimposed as a dashed line.
plotFeatures(exhaledSet,mz=51.0442,colorBy="subfolder",typePlot = "ggplot")
plotFeatures(exhaledSet,mz=67.0539,colorBy="subfolder",typePlot = "ggplot")
plotFeatures(exhaledSet,mz=137.131,colorBy="subfolder",typePlot = "ggplot")
dirRaw <- system.file("extdata/mycobacteria", package = "ptairData")
bacteriaSet <- createPtrSet(dirRaw,
setName="bacteriaSet",
mzCalibRef = c(21.022, 60.0525),
fracMaxTIC = 0.7,
saveDir = NULL )
## Control1.h5 check
## Control2.h5 check
## Specie-a1.h5 check
## Specie-a2.h5 check
## specie-b1.h5 check
## specie-b2.h5 check
plot(bacteriaSet)
plotTIC(bacteriaSet,showLimits = T,baselineRm = T)
bacteriaSet <- detectPeak(bacteriaSet)
## Control1.h5 : 37 peaks detected
## Control2.h5 : 37 peaks detected
## Specie-a1.h5 : 41 peaks detected
## Specie-a2.h5 : 37 peaks detected
## specie-b1.h5 : 50 peaks detected
## specie-b2.h5 : 40 peaks detected
eSet <- alignSamples(bacteriaSet, group="subfolder", fracGroup = 1, bgThreshold = 4 )
## 19 peaks aligned
eSet <- ptairMS::impute(eSet, bacteriaSet)
## specie-b1.h5 done
eSet<-annotateVOC(eSet)
You can know use known package as ropls to performs PCA or OPLS analysis.
X <- Biobase::exprs(eSet)
SMD <- Biobase::pData(eSet)
features <- Biobase::fData(eSet)
ropls::view(log2(X),printL = FALSE)
pca<-ropls::opls(t(log2(X)),crossvalI=5,predI=2,info.txtC="none",fig.pdfC="none")
plot(pca,type="x-score",parAsColFcVn=SMD$subfolder,parEllipsesL=F)
dim1<-ropls::getLoadingMN(pca)[,1]
barplot(sort(abs(dim1),decreasing = T))
knitr::kable(head(features[names(sort(abs(dim1),decreasing = T)),c("vocDB_formula_Hplus","vocDB_cas.name")]))
| vocDB_formula_Hplus | vocDB_cas.name | |
|---|---|---|
| 57.07 | [C4H8+H]+, [H8C4+H]+ | (E)-, (Z)-, 2-butene, 2-methyl-1-propene, but-1-ene, butene ui, cyclobutane |
| 58.0722 | ||
| 87.0795 | [C5H10O+H]+, [H10C5O+H]+ | 2-methylbutanal, 2-pentanone, 3-methyl-2-buten-1-ol, 3-methyl-3-buten-1-ol, 3-methylbutanal, 3-pentanone, E-2-penten-1-ol, methylbutanal, pentanal |
| 74.0661 | ||
| 73.0649 | [C4H8O+H]+ | 2-butanone, 2-buten-1-ol, 2-methylpropanal, butanal, tetrahydrofuran |
| 85.1027 | [C6H12+H]+ | 1-hexene, 2-methyl-1-pentene, cyclohexane, methylcyclopentane |
plotFeatures(bacteriaSet,mz=63.0299,colorBy="subfolder",typePlot = "ggplot")
plotFeatures(bacteriaSet,mz=77.0607,colorBy="subfolder",typePlot = "ggplot")
plotFeatures(bacteriaSet,mz=87.0795,colorBy="subfolder",typePlot = "ggplot")